Lifelong Machine Learning Potentials

Machine learning potentials (MLPs) trained on accurate quantum chemical data can retain the high accuracy, while inflicting little computational demands. On the downside, they need to be trained for each individual system. In recent years, a vast number of MLPs have been trained from scratch because learning additional data typically requires retraining on all data to not forget previously acquired knowledge. Additionally, most common structural descriptors of MLPs cannot represent efficiently a large number of different chemical elements. In this work, we tackle these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs), which combine structural properties and element information from the periodic table. These eeACSFs are key for our development of a lifelong machine learning potential (lMLP). Uncertainty quantification can be exploited to transgress a fixed, pretrained MLP to arrive at a continuously adapting lMLP, because a predefined level of accuracy can be ensured. To extend the applicability of an lMLP to new systems, we apply continual learning strategies to enable autonomous and on-the-fly training on a continuous stream of new data. For the training of deep neural networks, we propose the continual resilient (CoRe) optimizer and incremental learning strategies relying on rehearsal of data, regularization of parameters, and the architecture of the model.


INTRODUCTION
For the prediction and understanding of the properties and reactivity of atomistic systems, the knowledge of the potential energy surface is inevitable. The potential energy surface is given by the electronic energy as a function of the nuclear positions and can be obtained from electronic structure methods such as density functional theory (DFT) or approximate wave function theories. 1,2 These methods are generally applicable and achieve accurate results for many systems, but even state-of-the-art approaches still lead to high computational demands in extended simulations. 3−5 Instead of explicitly calculating the electronic energy for each nuclear conformation in Born−Oppenheimer approximation, empirical force fields rely on approximate (simple) analytical expressions of the potential energy surface. Hence, they avoid the explicit integration of quantum mechanical equations and thereby enabling efficient atomistic simulations. However, force field expressions are of limited accuracy and they are typically not universal for all chemical systems. 6−9 By contrast, machine learning potentials (MLPs) 10−15 can preserve the high accuracy of electronic structure methods but at low computational cost comparable to that of force fields. MLPs are not based on physical approximations but rely on very flexible mathematical expressions to obtain an analytical potential energy surface. The parameters of these expressions are trained on electronic structure reference data including chemical structures and their respective energies and atomic forces. While first-generation MLPs were only applicable to low-dimensional systems, 16 the introduction of secondgeneration MLPs in the form of high-dimensional neural network potentials (HDNNPs) 17−19 has facilitated atomistic simulations of systems with tens of thousands of atoms on nanosecond time scales with an accuracy similar to that of firstprinciples methods. In recent years, various other types of MLPs have been proposed such as neural network potentials, 20−23 Gaussian approximation potentials, 24 moment tensor potentials, 25 and many more. Second-generation MLPs rely on the locality of the major part of atomic interactions. 19 However, third-generation MLPs add the description of longrange interactions 24,26 and fourth-generation MLPs even nonlocal interactions 27,28 beyond the applied cutoff sphere for atomic interactions.
Reference atomistic structures need to represent sufficiently well the conformation space to be explored in subsequent simulations. The reason is that MLPs are designed for interpolation of learned atomic interactions, but the extrapolation capability far beyond the trained atomic environment space is limited. 12,18 For example, an MLP trained on water will fail for gaseous dihydrogen and dioxygen. A systematic construction approach is not feasible for most systems because of the enormous conformation space that is typically accessible for an atomistic system, especially when one considers various large-amplitude motions or chemical reactions. To obtain reasonable reference conformations, iterative active learning protocols have been devised. 29−33 In such protocols, preliminary MLPs are validated in atomistic simulations to identify important yet unknown conformations. These conformations are recalculated with the reference method and then added to the reference data to train an improved MLP. However, it is difficult to ensure that a reference data set is complete even for a specific application because some conformations may occur only very rarely. Also, different scientific targets and purposes may, for the same system, highlight different regions of its conformation space and, therefore, require different reference conformations.
Current MLPs learn in isolation, i.e., a model is trained on predefined data and is subsequently applied in simulations without exploiting for future MLPs the knowledge already learned. Despite the general functional form, MLPs can therefore only be applied reliably to a specific conformation space�which can span various chemical formulas and configurations�and the transferability significantly beyond their learned atomic environment space is limited. 8,9 As a consequence, a vast number of single-purpose MLPs have been constructed. 34−42 Recent work has attempted to overcome this limitation by leveraging very large and broad training data sets, 43,44 but future applications may still require different reference data.
To achieve a general-purpose solution, MLPs need to be adaptable to learn new chemical systems based on already acquired knowledge. Obviously, the training data sets of present-day MLPs can be simply extended to include additional systems in active learning procedures. However, the training process of new or extended data sets is often started from scratch with randomly initialized MLP parameters. Even if previously obtained parameters are used, the training typically does not differentiate between new and old data points. Hence, all data points need to be trained simultaneously from a stationary batch of data every time new data is added. Consequently, learning larger data sets becomes increasingly inefficient as the amount of repeated training of old data grows. This trend is opposite to how the human brain operates, which is by efficiently learning additional knowledge based on already acquired expertise. 45−47 This ability of incremental learning is referred to as lifelong learning or continual learning in the field of machine learning, 46,47 and it is also desirable for MLPs. Lifelong machine learning means that the model continually acquires and fine-tunes knowledge. For MLPs it will be a singleincremental-task scenario because new chemical structures are sequentially added, but the energy needs to be distinguished among all encountered structures. 48 Lifelong learning is a challenge for machine learning because incremental learning from a continuous stream of information typically results in catastrophic forgetting and interference. 49,50 The MLP parameters are adapted to the new data, while the memory about the old data fades away.
In principle, building on prior knowledge could even reduce the necessary amount of training data and more complex tasks could be learned more efficiently. According to complementary learning systems theory, 47 learning is based on episodic memory and generalization. While the episodic memory learns arbitrary information fast, the generalization is a slow learning process to structure the knowledge. Therefore, the episodic memory is similar to the training data set, while the MLP parameters are trained to generalize the data. Consequently, the parameters need to be plastic to integrate new knowledge, but, at the same time, they need to be stable to avoid forgetting. 45−47 In recent years, several algorithms have been developed to reach this stability−plasticity balance and to mitigate catastrophic forgetting; examples are ExStream, 51 gradient episodic memory, 52 synaptic intelligence, 53 AR1, 48 progressive neural networks, 54 and growing dual-memory. 55 These approaches rely on rehearsal of selected training data, regularization of machine learning parameters, and/or the architecture of the machine learning model.
The situation becomes more complex for a general-purpose or universal MLP as it would be further hindered by the unfavorable scaling in computational cost with the number of chemical elements. This feature plagues most common MLP descriptors 9,56,57 such as atom-centered symmetry functions (ACSFs), 58 smooth overlap of atomic positions, 59 and many more 60,61 �with exception of graph representations. 62−64 The reason is a rapid increase in the number of structural descriptors since these are constructed independently of one another for each element combination. Some studies have attempted to overcome this problem by combining element and structure information in the structural descriptor through the inclusion of a single additional element-dependent weight. 65−67 Despite some successful applications, employing, for example, the atomic number as in weighted ACSFs 67 is restricted to elements of similar atomic numbers. Otherwise, the contributions of light atoms, such as H, are obscured by contributions of heavy atoms, such as iodine (e.g., factor 53 · 53 in angular contributions).
In this work, we exploit additional weight terms based on different properties for separate ACSFs that can reduce bias toward specific elements and increase the discriminability of the descriptor representation for different conformations. Chemical intuition is based on regularities in molecular structure and on trends in the periodic table. To exploit these trends, we use information about the period and the group, 68 or more precisely, the position in the s-, p-, d-, and fblock instead of the atomic number. For instance, all halogens lead to a similar type of bonding, while the bond length increases in higher periods.
Furthermore, we propose that uncertainty quantification can enable the application of adaptable MLP parameters. Adaptable parameters require that the reliability of an MLP is probed on the fly to avoid full validation of each MLP training extension. As long as the error of the MLP is below a tolerance in production calculations, little variations with the learning process will not significantly affect the results. Moreover, MLPs with uncertainty quantification are able to report warnings in case of unknown conformations, even if these are within the general training space. Therefore, the uncertainty quantification can be employed to estimate the transferability of the MLP, with low uncertainty meaning high transferability. To obtain uncertainty quantification, for example, for HDNNPs, some studies have applied an ensemble approach. 69−73 A small ensemble of HDNNPs is then trained differently and independently on the same reference data. The deviations between the predictions of the individual HDNNPs can be employed as a proxy to the prediction error due to the high flexibility of the neural networks. Moreover, taking the average of the ensemble improves the accuracy of the representation. We note that such uncertainty measures should accompany any sort of modeling approach in atomistic simulations, 74 although their implementation has only started recently.
Consequently, this work introduces element-embracing atom-centered symmetry functions (eeACSFs) to overcome the unfavorable scaling with the number of elements up to an arbitrary number. In combination with uncertainty quantification and continual learning strategies, the concept of a lifelong machine learning potential (lMLP) is proposed, which can be trained in a rolling fashion by a continuous stream of new data. To perform this challenging training process, we present the new continual resilient (CoRe) optimizer which is an adaptive method for stochastic first-order iterative optimization. CoRe is applied in combination with our lifelong training strategies, which provide an adaptive selection of employed training data, including a reduction of the data set size and the removal of doubtful data. We demonstrate the performance of our lMLP concept for a data set including 42 different S N 2 reactions and thereby ten different chemical elements. Although our work on the lMLP concept rests on a second-generation HDNNP representation, we emphasize that the concept can also be applied for a different base model. Furthermore, the CoRe optimizer and the lifelong adaptive data selection can be employed in training of machine learning models beyond lMLPs and contribute to the development of lifelong machine learning to mitigate catastrophic forgetting.
This work is organized as follows: In Section 2 we summarize the HDNNP method and introduce eeACSFs, the CoRe optimizer, lifelong adaptive data selection, uncertainty quantification, and lMLPs. After presenting the computational details in Section 3, we move to Section 4, which starts with a description of the reference data. Section 4 continues with a performance assessment of the eeACSF representation, a comparison between the results of CoRe and those of other optimizers, performance tests of the lifelong training strategies, and a validation of the uncertainty quantification in potential energy surface predictions. This work ends with a conclusion in Section 5.

High-Dimensional Neural Network
Each atomic energy contribution is obtained from a feedforward neural network, The weights a and b are trained individually for each element m according to energies and atomic forces of a training data set containing multiple chemical structures. We note that the number of hidden layers can differ from the case shown, which is for two hidden layers with n 1 and n 2 neurons each. A vector of local structural descriptors G n m of dimension n G , which is explained in the next section, is given as input to this atomic neural network. We propose the activation function f(x) = 1.59223 · tanh(x), which is discussed in Section S1.1 of the Supporting Information along with a tailored weight initialization scheme. 40,75 The convergence advantage and accuracy increase of this activation function and weight initialization compared to a hyperbolic tangent and a nontailored weight initialization is shown in Table S3 and Figures S1 (a), S1 (b), and S2 in the Supporting Information.
In contrast to the original HDNNP method 17 the trainable weights α i m and β i m are introduced for standardization of the structural descriptor input G n,i m . For this purpose, the structural descriptor is shifted by β i m , which is initialized by the mean of the descriptor values G n,i m in the initial training data set for all atoms n of the respective element m. The result is scaled by α i m , which is initialized by the inverse standard deviation of the respective descriptor values in the initial training data. In this way, the weights a iκ m,01 of the input layer are multiplied with values, which are centered around zero and show a standard deviation of one for the initial training data. This standardization can improve the training performance and is adjustable in case of additional training data. We note that the weight initialization can be restricted to values inside a certain interval to avoid numerical issues. The weight pair a iκ m,01 and α i m may be combined, but we treat them separately for technical reasons during the optimization.

Element-Embracing Atom-Centered Symmetry Functions.
HDNNPs usually employ vectors of many-body atom-centered symmetry functions 58 to represent the local atomic environments. These descriptors fulfill the translational, rotational, and permutational invariances of the potential energy surface. They depend on distances and angles of all neighboring atoms which are inside a cutoff sphere of radius R c . This radius needs to be sufficiently large to account for all relevant interactions. Since no connectivities are used, the HDNNP is able to describe chemical reactions. The dimensionality of the input vector does not depend on the individual atomic environment�a requirement to obtain generally applicable atomic neural networks. Conventional ACSF vectors are constructed in such a way that each ACSF represents only all interactions between a specific chemical element pair or triple. This construction, however, leads to an unfavorable scaling in the number of descriptors with respect to the number of elements.
For this reason, we introduce element-embracing atomcentered symmetry functions (eeACSFs) that explicitly depend on the element information H from the periodic depend in addition on the angle θ njk between atom n and the two neighbors j and k. Different values for the parameters η i rad ⩾ 0, η i ang ⩾ 0, λ i = ± 1, and ξ i ⩾ 1 of each eeACSF i eventually produce a structural fingerprint of the atomic environment of atom n. The cutoff function, l damps the contributions smoothly to zero beyond the cutoff radius R c . The advantage of this cutoff function is that also the derivatives of all orders are zero at R c , which is beneficial for the calculation of forces, normal modes, and so forth. In contrast to ACSFs, a square root is applied to eeACSFs to mitigate a too strong effect of the number of neighbors on the eeACSF value. Such a strong effect can be observed for training data including very different molecule sizes and particle densities, and it can decrease parametrization performance due to a too broad range of input values. For the construction of radial eeACSFs without element dependence, H i,j rad = 1 can be applied. In this work, the maximum period is set to X − 1 = 5, i.e., elements up to xenon are considered. The descriptors n, m, and d are employed to balance contributions of light and heavy elements as well as of elements with few and many valence electrons. For main group elements d = 0 is used and for d-block elements m = 0. For elements of higher periods, the group in the f-block can be implemented in the same way as for the d-block and X − 1 can be set to 7. To keep the contribution of each interaction to a value between 0 and 1, the radial eeACSF is divided by H max,i rad , which is the maximum possible value of H i,j rad . The element-dependent terms H i,jk ang of the angular eeACSFs are calculated as linear combinations, As a consequence, the element-dependent prefactor of the angular eeACSF is defined as In conclusion, for systems with/without d-block elements five/seven different H i,j rad and nine/eleven H i,jk ang terms are required. In combination with typically around five η i rad , two η i ang , two λ i , and three ζ i parameter values, the descriptor vector will consist of 25/35 radial and 108/132 angular eeACSFs independent of the number of elements. By contrast, the number of radial ACSFs is proportional to the number of elements and the number of angular ACSFs scales with ∑ m = 1 N elem m. For example, in the case of four elements, 20 radial and 120 angular ACSFs are obtained for the same number of parameter values. Hence, at around four elements is the breakeven point of computational cost, since the additional effort for determining H i,j rad and H i,jk ang from the listed element-dependent values is small. We note that the number of different parameters values can affect the resolution of the representation. The computational bottleneck in the descriptor calculation during MLP applications is the determination of the derivatives as a function of the atomic positions to obtain the atomic forces. Since no additional position-dependent properties are included in eeACSFs compared to ACSFs, the computational cost per eeACSF is similar to that of ACSFs.

Training of High-Dimensional Neural Network Potentials.
To optimize the weights of the atomic neural networks with respect to potential energies E ref,r and Cartesian atomic force components F α,n ref,r of reference conformations r, a loss function is defined: A set of N conf conformations is trained simultaneously in each training epoch t to accelerate the optimization and reduce overfitting of single data points. Since the HDNNP prediction of a conformation can be dependent on the atomic neural networks of different chemical elements, these networks are also trained simultaneously. To balance the contributions of energies and forces to the loss function, the hyperparameter q is used. This hyperparameter needs to be chosen with care as it can significantly affect the training performance. To optimize the weights, in each training epoch the gradient of the loss function with respect to the weights, , , is calculated, where ξ is a unique index of each weight.

Continual Resilient (CoRe) Optimizer.
To improve the training process, we developed the continual resilient (CoRe) optimizer that aims to combine the robustness of resilient backpropagation (RPROP) 76,77 with the performance of the Adam optimizer. 78 Moreover, we introduce adaptive decay rates of moving averages of the loss function gradients, plasticity factors of the weights obtained from an importance score, and weight decays bounding the weight values to increase the convergence speed and final accuracy beyond state-of-the-art optimizers.
CoRe is a first-order gradient-based optimizer. It employs individual adaptive learning rates for each weight w ξ , which depend on optimization history. The algorithm is intended for stochastic iterative optimizations, i.e., subsamples of the batch of training data are used in each training epoch. Thus, computational efficiency benefits of stochastic gradient decent (SGD) 79 can be exploited.
In the spirit of the Adam optimizer, exponential moving averages of the gradient, 13) and the squared gradient, with decay rates β 1 τ , β 2 ∈ [0, 1) are activated in the computation of the individual adaptive learning rates of each weight w ξ t . τ is the optimization step counter for each weight, which can be different from the training epoch t. For example, for HDNNPs a training data subsample of a training epoch may not contain every chemical element of the entire training data set leading to weight updates only for some of the atomic neural networks in the respective training epoch. In this way, the HDNNP model already provides an architectural strategy to mitigate catastrophic forgetting. We note that Equations 13 and 14 represent the case of minimization, while for maximization the sign of the loss function derivative with respect to the weight must be inverted.
In contrast to the Adam optimizer, β 1 is a function of τ, Ä The hyperparameters β 1 a , β 1 b ∈ [0,1) define the initial and final values of β 1 . These values are interconverted by a Gaussian with hyperparameter β 1 c > 0. A larger β 1 value increases the dependence on previous gradient information, that is, on the optimization history of all recent training data subsamples. Thereby, the optimization performance with respect to the entire training data set can be improved. A smaller β 1 value yields a higher dependence on the current gradient of the subsample, decreasing in some way the moment of inertia of the optimization process. The latter is beneficial for rapidly and strongly changing gradients occurring in fast convergence at the beginning of an optimization, which is started from randomly initialized weights. In conclusion, Equation 15 can be employed to increase β 1 in the course of the optimization to improve the training performance.
One contribution of the weight update magnitudes of CoRe is obtained, in analogy to the Adam optimizer, according to The moving averages g ξ τ and h ξ τ get bias corrected by 1−(β 1 τ ) τ and 1−(β 2 ) τ , respectively, to counteract their initialization bias toward zero (g ξ 0 , h ξ 0 = 0). The division of the two bias-corrected moving averages makes the weight update magnitudes invariant to gradient rescaling. A form of step size annealing is obtained, because a decreasing gradient value leads to a decrease of the update u ξ τ . Therefore, for well-behaving optimizations the absolute value of u ξ τ typically decreases from ±1 in the first optimization step τ = 1 toward zero. Higher values of the hyperparameter β 2 promote this annealing. The hyperparameter ϵ ≳ 0 is added for numerical stability.
As a second contribution to the weight update, the plasticity factor, is introduced to mitigate forgetting of old information. In the initial training phase, this factor equals one. When τ > t hist , with hyperparameter t hist > 0, the plasticity factor can freeze the values of some weights by setting P ξ τ to zero. The selection of these weights depends on a score value S ξ τ−1 , which ranks the importance of the weights with regard to previously predicted loss function decrease, as will be explained in Equation 20. The score value S ξ τ−1 is compared to all other score values S χ m,μν,τ−1 of the same group. For HDNNPs, a group is composed by the weights within the atomic neural network of the same element m, with the same weight type χ, and the same layer assignment μν. The weights belonging to the n frozen highest score values in their respective group are frozen for the optimization step τ. The hyperparameter n frozen,χ m,μν ⩾ 0 can be set individually for each group of weights. In this way, a regularization is established for weights with the highest estimated importance.
A further contribution of the step size adjustment is adapted from RPROP, In general, the step size s ξ τ does not depend on the magnitude of the gradient but only on its sign, yielding a robust optimization. Every time g ξ τ changes its sign compared to g ξ τ−1 , the previous optimization step probably jumped over a local minimum. Then, the previous step size s ξ τ−1 was too large and needs to be decreased. If the signs of g ξ τ and g ξ τ−1 are the same, s ξ τ−1 can be increased to speed up convergence. The decrease factor η − and increase factor η + , with 0 < η − ⩽ 1 ≤ η + , are applied to obtain the current step size s ξ τ , whereby s ξ τ is bounded by minimal and maximal step sizes s min ,s max > 0. If P ξ τ , g ξ τ , and/or g ξ τ−1 are zero, the step size update is omitted. The step size s ξ 0 needs to be initialized determining the first optimization step size s ξ 1 . The value can be chosen in reasonable proportion to the initial weight values, and experience has shown that the precise choice of this parameter is rather noncritical due to the fast adaption. We note that the gradient is not reset to g ξ τ = 0 for g ξ τ−1 · g ξ τ < 0, in contrast to some RPROP variants that include a backtracking weight step. 76,80 As a consequence, the history of g ξ τ can be retained.

Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article
To decrease the risk of overfitting, the weight update, includes a weight decay with weight decay hyperparameter d χ m,μν ∈ [0, (s max ) −1 ) . The weight w ξ t−1 is reduced by the fraction obtained from the product of d χ m,μν and the absolute current weight update |u ξ τ | · P ξ τ · s ξ τ . We note that the sign of the weight update�equal to the sign of g ξ τ �is only encoded in u ξ τ . The inverse weight decay hyperparameter is the maximal absolute weight value in well-behaving optimizations, i.e., u ξ τ ≤ ± 1, preventing strong increases or decreases of weights. To obtain the new weight w ξ t , the current weight update u ξ τ · P ξ τ · s ξ τ is subtracted from the previous weight w ξ t−1 . Hence, the previous weight is changed in the opposite direction than the sign of g ξ τ with an individually adapted learning rate. The score value, accounts for weight specific contributions to previous loss function decreases. It is inspired from the synaptic intelligence method. 53 The loss function decrease is estimated by the product of the moving average of the gradient g ξ τ and the weight update u ξ τ · P ξ τ · s ξ τ for each step τ. For infinitesimally small changes in the opposite direction of the gradient, the product of gradient and change equals the respective loss function decrease. For larger updates in optimizations, as in the present case, the loss function decrease is typically overestimated by this product, but still reasonable. In addition, the gradient will be noisy if it is calculated for a subsample of the entire training data set. The sign inversion of the update is omitted in the calculation of the score value. Consequently, a higher positive score value represents a larger loss function decrease. The score value is initialized as S ξ 0 = 0. For each τ ≤ t hist , g ξ τ · u ξ τ · P ξ τ · s ξ τ is summed with equal contribution (t hist ) −1 to build an initial history. Afterward, S ξ τ is calculated as an exponential moving average with decay parameter 1 − (t hist ) −1 . The score value identifies the most important weights for the accurate prediction of previous training data. These weights can then be restricted by the plasticity factor (Equation 17), to balance the stability−plasticity ratio.
As a variant, CoRe can also use the sign of the moving average of the gradient g ξ τ as update factor u ξ τ , i.e., u ξ τ = sgn(g ξ τ ) . Furthermore, CoRe can be employed without plasticity factors, i.e., n frozen,χ m,μν = 0. The Adam optimizer is a special case of CoRe for the hyperparameter settings were employed in this work, while the total number of epochs were 1500, 2000, and 2500. In addition, n frozen,χ m,μν was set to 1% of the number of weights a κλ m,μν and b λ m,ν in their respective group, except for those weights associated with the output neuron. β 1 a and β 1 b need to be smaller for rapidly and strongly changing gradients. For example, β 1 a = 0.45 and β 1 b = 0.7,0.725 were used in this work. We note that for highly uncorrelated training data an increase of the subsample size can be beneficial to reduce gradient fluctuations. 2.5. Lifelong Adaptive Data Selection. Stochastic optimization, which is gradient calculation on a subsample of the training data in each epoch, can retain the computational demand on a manageable level for large data sets. However, random subsamples may lead to an inefficient approach because various data are typically represented at different levels of accuracy. This issue will be even more severe if additional training data are added during the training process in lifelong machine learning. Moreover, redundant and incorrect data need to be removed during the training process to be able to learn autonomously from a continuous stream of new data. Still, rehearsal of representative old data is very important to avoid catastrophic forgetting.
To find a solution for these issues, we developed an algorithm for lifelong adaptive data selection. The first key ingredient of the algorithm is an adaptive selection factor S hist r for each training conformation r depending on the conformation's loss function contribution and the training history. The second key ingredient is a balancing scheme for the stochastic choice of good and bad represented training data. The algorithm intends to exclude redundant and incorrect data and to improve the training performance by balancing adjustment to new or bad represented data and rehearsal of good represented data.
The lifelong adaptive data selection algorithm can be subdivided into a part carried out before the loss function calculation and a part after this calculation. In the first part (Algorithm 1) the training data subsample is chosen for which the loss function and the respective gradients are calculated. This subsample is indicated by the index "fit" as these data are used for fitting the weights. The number of conformations to be fitted per epoch N fit is a training hyperparameter and can be chosen based on the data set size and the correlation between the data points. Since conformations can be excluded by an adaptive selection factor S hist r (Algorithm 2), N fit cannot be larger than the number of conformations with S hist r > 0, i.e., len (S hist >0 ) . The training data subsample is split into a set of N good already well represented conformations and N bad new or insufficiently represented conformations. The fraction of good data is defined by p good (Algorithm 2), while N good has to be an integer. p good is initialized as 0.

pubs.acs.org/JCTC Article
To determine the probabilities P bad and P good for conformations to be chosen as bad or good data, respectively, the contributions of the conformations to the loss function are employed. The last calculated contributions L old (or Not a Number (NaN) if no contribution was calculated for a conformation so far) are divided by the maximal, non-NaN contribution L old max . These quotients are multiplied by the conformation specific adaptive selection factors S hist to yield the probabilities P bad . S hist is initialized as unity vector. All NaN components of P bad are set to the maximum of S hist . For normalization P bad is divided by the sum of its components. A random choice of N bad conformations with probabilities P bad is selected from the training data set D to be part of the training data subsample.
To obtain N good good data from the remaining data set D \fit , the probability P good is determined. P good is proportional to the difference between a unity vector 1 and the quotient of L old \fit and L old max . The minimum nonzero probability of P good is determined and multiplied by ϵ′ ≳ 0 to obtain a very low probability P good min for nonexcluded but unfavored conformations. P good entries are set to P good min if the loss function contribution of the corresponding conformation is equal to the maximal contribution or if the P good entry is NaN. Subsequently, P good is divided by the sum of its contributions. The good data subsample is a random choice of N good data from D \fit with probabilities P good . The union of the selected bad and good data is then employed as training data subsample to calculate the loss function.
To update the adaptive selection factors S hist , the new loss function contributions L new fit of the conformations in the training data subsample are used (Algorithm 2). Therefore, a weighted sum of energy loss L E fit and force loss L F fit is employed, which is divided by the number of atoms of the respective conformations N atom fit . L rel , which is the quotient of L new fit and the total loss L new fit of the training epoch as defined in Equation 10, is compared to different thresholds T F 1 < 1 < T F 2 < T F 3 < T X to update S hist and the exclusion strike counter X. The latter is introduced to exclude prediction outliers. Under the assumption that the model is reasonable, outliers are presumably incorrect data points. Therefore, X�initialized as zero vector�counts how many consecutive evaluations L rel r are greater than the threshold T X . If X r reaches the hyperparameter N X , the adaptive selection factor S hist r is set to zero and the respective conformation is thus excluded from further training.
The adaptive selection factors S hist are further modified depending on the following conditions: If L rel r ≥ T F 1 and S hist r < 1, S hist r is set to one. If L rel r ≤ T F 2 and S hist r > 1, S hist r is also changed to one. In this way, S hist r of well represented data (L rel r < T F 1 ) can be lower than one leading to lowering of the probability to be chosen for the training data subsample. On the other side, S hist r of bad represented data (L rel r > T F 2 ) can be greater than one increasing the selection probability. For  values below the S min are set to zero for data set reduction because the associated conformations are steadily well represented by the model. Thus, these data are presumably redundant and can be excluded since rehearsal of other data and/or other learning strategies are sufficient to not forget these conformations. All S hist r values above the S max are also set to zero because the model was not able for many optimization steps to represent these conformations accurately. Therefore, under the assumption that the model is reasonable, these conformations are not consistent to the major part of the other data and can be removed from the training data. In this way, the training performance improves for the other data.
For adaptive balancing of learning bad represented data and not forgetting good represented data, the fraction of good data p good is modified according to the change of the total loss L new fit − L old fit . p good is increased by p ± if the loss change is positive, while it is decreased by p ± for negative loss change. L old is initialized as infinity and p ± is defined as p good max · (N p ) −1 , with the maximal p good value p good max and the total number of possible p good values N p . Lastly, the new loss function contributions L new fit overwrite the old loss function contributions L old fit for the conformations of the training data subsample and the new total loss value L new fit replaces the old total loss value L old fit .

Uncertainty
Quantification. An lMLP requires the prediction of the uncertainties of the energy ΔE and the atomic force components ΔF α,n in addition to their values for a chemical structure. The reason is that a prerequisite of an lMLP is that it can produce results at every training stage. Since the predictions at different training stages can vary, the lMLP needs to provide a measure of the prediction accuracy. Then, the predictions can be compared within their accuracy in which they should agree. In this way, a reliable method is obtained that still can adapt to new information over time. We note that uncertainty quantification also significantly advances other computational chemistry methods because overinterpretation of features beyond the method's accuracy can be prevented.
The uncertainty of machine learning models can be separated into contributions from noise in the reference data, bias of the model, and model variance contributions. 81 The handling of a noisy training data is addressed by the lifelong adaptive data selection, while noise of the test data does not influence the model performance but only the performance evaluation. For correctly converged electronic structure calculations the noise is determined by the convergence thresholds and thus small for MLP reference data. Model bias is caused by limitations in the model architecture, representation by the descriptors, and reference data set size. Therefore, the model architecture needs to be optimized for different applications and elaborate descriptors need to be tested to keep this error contribution small. Lifelong learning addresses the remaining part of incomplete reference data. Model variance error can be reliably reduced by ensemble (or committee) models and its size can be predicted by statistics. The respective uncertainty estimation is based on the huge flexibility of the mathematical expressions underlying machine learning models like artificial neural networks. For well trained conformation space the ensemble member predictions presumably agree, while the predictions are arbitrary for unknown conformation space. Therefore, they can spread largely for an ensemble of independently and differently trained MLPs, i.e., using, for example, different initial weights.
Consequently, ensembles of MLPs have been used for a straightforward uncertainty quantification of model variance contributions. [69][70][71][72][73]82 The final energy prediction E is obtained by the mean of the ensemble of N MLP MLP energies E p , Analogously, the atomic force components F α,n are calculated as mean of the negative energy gradient with respect to the Cartesian coordinates α n of an MLP ensemble, while the minimal uncertainty is defined by the root-meansquare error RMSE(E) of the reference data. The atomic force component uncertainties ΔF α,n can be analogously calculated. The scaling factor c is introduced to adjust the sample standard deviation providing a certain confidence interval that the uncertainty quantification is equal to or larger than the actual error with respect to the reference data. However, this uncertainty quantification misses model bias contributions and hence underestimates the error in cases where the lack of training data is the primary error source. Despite that, as also the model variance increases in these cases, this uncertainty quantification can still signal whether the error will be high even if the predicted uncertainty value is not accurate. For applications of MLPs this information is sufficient because small uncertainties can be well predicted and the occurrence of large errors requires in any way an improvement of the calculation which can be achieved by lifelong machine learning.

Perspective of Lifelong Machine Learning Potentials.
On the long-term perspective an lMLP that learns more and more systems over time while the training benefits from prior learned knowledge of chemical interactions is the grand goal. The lMLP learns then in the same way as chemistry evolves, i.e., by systematically expanding prior knowledge. As this work is a proof of concept, the presented lifelong training strategies are naturally not in such an advanced state yet to reach this grand goal. In the short term perspective with the presented algorithms, lMLPs will presumably be trained on data of related systems because the benefit is the largest and the difficulty is the lowest in this case. When the lifelong training strategies as well as the MLP models are developed to a more elaborate state, more and more information can be combined in one lMLP.
This work presents a comprehensive basis of ingredients for lMLPs. Future works need, for example, to develop algorithms for the continual expansion of the model architectures to avoid information capacity limitations of the underlying deep Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article learning representations. For instance, neural network growth can be established by using differently sized architectures of the lMLP ensemble members. By tracking their relative performance the requirement to grow can be identified and individual ensemble members can be adapted. Moreover, coarse-grained structural identifiers like the bin and hash method 83 can be applied to augment the selection and exclusion criteria of training data used for rehearsal. In addition, they can be employed for fast checks whether a certain conformation is represented by similar ones in the training data. From this information unknown and badly represented conformations can be distinguished to avoid redundant reference calculations. Current MLP models are often limited to the representation of a single electronic state. Thus, for example, only a specific total charge and spin multiplicity can often be represented by a single MLP. However, recent works target these limitations for more general MLP models 28,33,84 which can then be used as base model for lMLPs. A current restriction of MLPs is the requirement of consistent training data, i.e., all training data have to be calculated by the same method, basis set, etc. Therefore, MLP models need to be developed that , for example, treat different reference methods in a similar fashion as different electronic states to establish lMLPs learning from data of different reference methods.
Further, open, free, and accessible reference data of published work is obviously highly important. To continue the training of previous work, not only the final weight values but also the associated CoRe optimizer properties of the weights τ, g ξ τ , h ξ τ , s ξ τ , and S ξ τ and the lifelong adaptive data selection properties L old r , S hist r , and X r of every still required training conformation as well as the values L old and p good need to be saved and made available. Moreover, autonomous workflows need to be set up in future work for a user-friendly development of an lMLP with direct interfaces to electronic structure and atomistic simulation software.

COMPUTATIONAL DETAILS
In the HDNNP models of this work, each atomic neural network consisted of an input layer with n G = 153 (eeACSFs) or 156 (ACSFs) neurons, three hidden layers with n 1 = 102, n 2 = 61, and n 3 = 44 neurons, and a single output neuron. The weight initialization is described in Section S1.1 of the Supporting Information. In predictions, the ensemble size N MLP = 10 was applied with the error scaling factor c = 2. We note that the ensemble size was not optimized for individual cases, but it can be in future applications depending on the improvement of value and uncertainty prediction versus additional computational demand.
For the training of each HDNNP, we split the reference data set randomly into 90% training conformations and 10% test conformations. In each training epoch, 10% of all training conformations were used for fitting reference data sets A and B ( Table 2) and 4.07% were employed for fitting reference data set C. The latter yields the same number of trained conformations per epoch for reference data set C compared to B. This counting includes conformations with an adaptive selection factor of S hist r = 0 and conformations which were first added at a late training epoch. To represent the molecular structures of the reference conformations, eeACSF vectors were equipped with the parameters given in Table 1. A larger than usual cutoff radius of R c = 12 Å was applied to avoid issues with electrostatic interactions, which are not in the focus of this work. However, future studies can easily adopt more elaborate schemes for electrostatics like fourth-generation HDNNPs. 28 The supervised training was performed on the total DFT energy minus the sum of the atomic DFT energies of the neutral free atoms in their lowest spin state (see Table S1 in the Supporting Information). All DFT data of S N 2 reactions were obtained for a total charge of −1 e, with e being the elementary charge, and a spin multiplicity of 1. The DFT calculations were performed with the quantum chemistry program ORCA (version 5.0.3). 85,86 The PBE exchangecorrelation functional 87 was chosen with the def2-TZVP basis set 88 and use of the RI-J approximation. In addition, D3 semiclassical dispersion corrections 89 with Becke-Johnson damping 90,91 were applied. To balance the training on the reference DFT energies and forces, the loss function hyperparameter was set to q = 10.9.
For SGD optimizations the best found learning rate of 0.00075 was applied. RPROP optimizations were performed with the hyperparameters recommended in Reference 76. An exception was the initial learning rate, for which 0.001 was found to be the optimal choice in this work. For the Adam optimizer the same hyperparameters were used as in Reference 78 since no general improvement by hyperparameter variations was found.
The hyperparameters of the CoRe optimizer were: β 1 a = 0.45, The lMLP software was written in Python and exploits the scientific computing package NumPy (version 1.21.6) 92 and the machine learning framework PyTorch (version 1.12.1). 93 It is available on Zenodo (DOI: 10.5281/zenodo.7912832) alongside the reference data sets, generated output of this work, and scripts to analyze and plot this output.

Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article
For a fair and reliable comparison of the different MLP trainings, we performed for each setting 20 trainings with different random numbers used in the splitting of training and test data, weight initialization, and training data selection process. For each setting, the best ten MLPs were employed in the analysis to reduce the dependence on the explicit random numbers. In this way, the effect of outliers was minimized that can originate from an unfavorable initial parameter choice, which can affect all optimizers. Thus, the provided means and standard deviations represent the performance for well working cases.
As we employed different splittings of training and test data in the training of ensemble members in this work, the ensemble prediction of the reference data mixes training and test data predictions. While this approach efficiently uses all reference data for training, unbiased validation of the ensemble needs to be performed on additional data.

Reference Data.
For the comparison of the performance of ACSFs and eeACSFs, a reference data set A was constructed containing ten different gas-phase S N 2 reactions. These reactions were represented by 2026 reference conformations and their respective energies and atomic force components obtained from the reference method. The reference conformations were different chemical structures which sampled the conformation space of the S N 2 reactions including the leaving groups X − = Cl − , I − and nucleophiles Y − = Cl − , HCC − , I − for central methyl carbon atoms and tertiary tert-butyl carbon atoms (Figure 1). We note that steric hindrance leads to high energy barriers in S N 2 reactions of tertiary carbon atoms. This set of S N 2 reactions included only four different elements so that the application of ACSFs was still feasible in terms of computational demand. The combinatorial growth of the ACSF vector with the number of elements hampers the application to reference data including more elements.
By contrast, the size of the eeACSF vector stays constant with the number of elements enabling training of reference data containing an arbitrary number of elements. To test eeACSFs, a reference data set B was compiled consisting of Both reference data sets A and B contained conformations obtained in constrained DFT optimizations. The distances carbon-leaving group and nucleophile-carbon were set in the range from 1.05 to 5.25 Å by an irregular sampling approach. A preliminary grid of distance values, which was more dense for smaller distances, was defined with individual minimal distances for each S N 2 reaction system. To obtain the final distance values, random changes on the grid values were applied with the constraint that the values cannot get smaller or larger than adjacent grid values.
To add the representation of structural distortions, 12551 conformations of the S N 2 reaction systems with central methyl carbon atoms were generated by random atomic displacements. For each conformation obtained in constrained DFT optimizations, three new conformations were generated by randomly displacing all atomic positions inside atom-centered spheres with the same radii of 0.05, 0.1, and 0.15 Å. If some interatomic distances turned out to be too small, the process was restarted with another set of random displacements. These conformations together with those of reference data set B formed reference data set C. We emphasize that these reference data were only employed in performance evaluations of lifelong learning and can be insufficient to carry out atomistic simulations because relevant reference conformations can be missing.
The only preprocessing of the reference data sets was a restriction to maximal absolute atomic force components of 15 eV Å −1 to exclude those conformations which only occur under extreme conditions. The energy and atomic force component ranges and standard deviations of the different data sets are compiled in Table 2 and are referred to in performance comparisons in the following sections. The mean energy ranges and standard deviations for the individual S N 2 reaction systems are provided in Table S4 in the Supporting Information.

Element-Embracing Atom-Centered Symmetry Functions.
For four elements the sizes of the optimized ACSF vector (156) and eeACSF vector (153) are similar (see Table  S2 in the Supporting Information for parameters of the ACSFs). Therefore, this number of elements is the turning point at which the eeACSF representation becomes computationally advantageous. Table 3 shows that the representation by ACSFs and eeACSFs of reference data set A yields HDNNPs with similar RMSEs for the test data justifying the structural representation by eeACSFs (see Figures S3 (a) and (b) in the Supporting Information for the prediction error distribution of the ensemble). The accuracy of HDNNPs using ACSFs is on average slightly better. This trend is expected due to the full separation of contributions from different element combinations, while these contributions are mixed in eeACSFs. However, the eeACSF representation is less prone to overfitting according to the given training and test RMSEs. The reason may be that every eeACSF value depends on all neighbor atoms inside the cutoff radius, while an ACSF value

Journal of Chemical Theory and Computation
pubs.acs.org/JCTC Article depends only on certain neighbor atoms. The latter may adjust better to very specific environments but worsens generalization and transferability. The Figures S4 (a) and (b) and S5 in the Supporting Information show that the convergence and training process is similar for ACSF and eeACSF representations.
To train the ten element containing reference data set B, the size of the ACSF vector would be 750 to obtain the same resolution by the parameters η rad , η ang , λ, and ξ. The resulting high computational demand can be prevented by applying eeACSFs with a constant vector size of 153. The accuracy of individual HDNNPs is RMSE(E train ) = (3.9 ± 0.4) meV atom −1 , RMSE(E test ) = (4.5 ± 0.6) meV atom −1 , RMSE(F α,n train ) = (99 ± 7) meV Å −1 , and RMSE(F α,n test ) = (116 ± 4) meV Å −1 . The higher accuracy of the results in Table 3 is a reason of the fewer and less complex reference data to be trained, while the model architecture and training hyperparameters remained unchanged. However, especially the HDNNP ensemble accuracy of RMSE(E) = 2.6 meV atom −1 and RMSE(F α,n ) = 64 meV Å −1 (see Figures S6 (a) and (b) in the Supporting Information for the prediction error distribution) is comparable to other state-of-the-art MLPs trained for less elements 19 and evidence that eeACSFs are able to represent the different local atomic environments including various neighbor elements. Further, the significantly improved accuracy of the HDNNP ensemble compared to that of individual HDNNPs supports the use of an ensemble beyond the access of uncertainty quantification. For performance comparisons of HDNNPs with other MLPs we refer to the References 42 and 64.
The energy RMSE is slightly larger than usual due to the relatively broad energy range to be trained (Table 2). Moreover, the mean energy range for the individual S N 2 reaction systems is also broad with 747 meV atom −1 for those with central methyl carbon atoms and 438 meV atom −1 for those with central tert-butyl carbon atoms. The atomic force component RMSE is somewhat lower than usual despite the broad range because a significant fraction of forces is close to zero.

Continual Resilient (CoRe) Optimizer.
Accurate MLPs can only be obtained if the optimizer can efficiently and reliably find apposite weight values in the high-dimensional parameter space. Training with a fixed learning rate as in SGD yields HDNNPs of poor accuracy. Most RMSE values of SGD results for reference data set B listed in Table 4 are almost an order of magnitude larger than those of CoRe results highlighting the importance of the optimizer. We found RPROP and the Adam optimizer to be the best performing optimizers available in PyTorch 1.12.1 for the given machine learning model and reference data. We note that RPROP is intended for batch learning on all data at once, while we perform stochastic optimization with RPROP. RPROP converges fast and smooth but plateaus at not satisfying accuracy (Figures 2 (a) and (b)). By contrast, the Adam optimizer requires more steps to reach the accuracy of RPROP, but it is able to reach lower RMSE values in the end. However, the convergence is noisy and therefore hampers continual applications.
Our CoRe optimizer converges even faster than RPROP and reaches a better final accuracy than the Adam optimizer (Figures 2 (a) and (b)). The convergence is still almost as smooth as that of RPROP. Hence, CoRe combines and improves the benefits of both, RPROP and Adam. We note that these trends also hold for a random training data selection (see Figures S7 (a) and (b) in the Supporting Information) instead of the lifelong adaptive data selection underlying the results in Figures 2 (a) and (b).

Lifelong Adaptive Data Selection.
Continual data set reduction is essential for lifelong machine learning to keep the amount of training data for rehearsal on a manageable level during incremental learning of new data. Figure 3 reveals how the training data set is narrowed during fitting. The training data reduction sets in after about 600 epochs because for data exclusion the adaptive selection factor S hist r needs to be decreased below S hist min or increased above S hist max by a specified number of consecutive applications of the decrease or increase factors, respectively (Algorithm 2). The optimization with RPROP plateaus already after about 600 epochs. Therefore, the data exclusion is too fast in the subsequent epochs, since the loss contribution of most conformations does not change much, biasing the importance evaluation by the adaptive selection factors. By contrast, the fluctuations in the convergence behavior of the optimizations with the Adam  The optimizers and lifelong adaptive data selection were applied for 1500 (RPROP) and 2000 (SGD, Adam, CoRe) epochs.

Journal of Chemical Theory and Computation
pubs.acs.org/JCTC Article optimizer lead to a slow reduction of the training data. The reason for this is that the data importance measures also undergo the fluctuations which hamper to overcome the exclusion thresholds. The training data reduction of optimizations using CoRe is in between RPROP and Adam yielding a more balanced process. The number of excluded conformations per epoch also reduces for advanced training stages in later epochs making the training more stable (see also Figure S5 in the Supporting Information). By contrast, the optimizations with RPROP become unstable after about 1500 epochs due to the too rapid and strong data reduction.
In the training of reference data set B using the CoRe optimizer, the lifelong adaptive data selection assigned on average (5.5 ± 0.3)·10 3 training conformations to be redundant after 2000 epochs. Therefore, the training data was reduced to 29% of the initial amount. Figures 2 (a) and (b) show that this data reduction does not lead to a decline of the accuracy, which would be expected if training is performed on a nonrepresentative subset of the training conformations. Despite the strong reduction of the amount of data, the lifelong adaptive data selection significantly improves the training accuracy compared to random data selection based on all training data (see Table S5 and Figures S7 (a) and (b) in the Supporting Information). This trend is observed for all optimizers. For the CoRe optimizer lifelong adaptive data selection yields an improvement of the ensemble RMSE values by 31% for the energies and 48% for the atomic force components compared to random data selection.
On average 38 ± 11 conformations were excluded from training because the model was not able to represent these conformations with a high accuracy. In this way, hindrance of the training process by these conformations can be avoided. Since the same 24 conformations were excluded in more than half of the training processes, these conformations are likely to be doubtful. Conformations, which are excluded only by a few individual HDNNPs of the ensemble, can still be predicted by the ensemble average (see Figures S6 (a) and (b) in the Supporting Information). In this way, the individual training processes can be improved, while the generalization of the ensemble prediction is still provided. Arising uncertainty for certain conformations due to their exclusion in some HDNNP trainings is covered by the uncertainty quantification. Therefore, this approach does not affect the reliability of the method. 4.5. Lifelong Machine Learning Potentials. Three frequently occurring example cases are explored, in which lifelong learning can be beneficial in comparison to iterative cycles of data set expansion and constructing new MLPs trained on all data. These cases represent training data completion of a sparsely sampled conformation space, expansion of the represented conformation space for the same chemical systems, and learning additional chemical systems. Lifelong learning can add an arbitrary number of new data points in each training epoch and does not have to be applied in a block-wise scheme as used in conventional active learning for MLPs. In this work, however, we added new data only in a single training epoch for a clear characterization of the resulting effects.
A sparsely sampled conformation space was obtained for the initial training epochs when a high random fraction of the training data was first available at a late epoch. Figures 4 (a) and (b) show that the proposed lifelong learning strategies can handle this case very well yielding an almost constant final lMLP accuracy with respect to the late data fraction p late . The number of epochs, after which the late data fraction was added, was lower for high p late because otherwise the small fraction of initial data would be overfitted. Figures S8 (b) and S9 in the Supporting Information show that the additional data were  added at those epochs where the accuracy of the test atomic force components plateaus or even increases. The mean ensemble accuracy for p late ∈ [0.5, 0.8] is RMSE(E) = 2.6 meV atom −1 and RMSE(F α,n ) = 68 meV Å −1 and hence very similar to that for p late = 0 (Table 4). In addition to the flexibility gained in the training process, the lifelong learning approach requires less training data to be handled in the initial epochs compared to training on all data. Figure S9 in the Supporting Information reveals that the number of excluded conformations by the lifelong adaptive data selection is similar after 1500 epochs for different values of p late .
To examine the performance for the case of the expansion of the represented conformation space for the same chemical systems, an lMLP was first trained on reference data set B for 1250 epochs. Subsequently, the additional structurally distorted conformations of reference data set C were added and training was continued for another 1250 epochs. Table 5 reveals that lifelong learning yields RMSE values for individual HDNNPs, which are about 13% higher than those of learning on a stationary batch of all training data of reference data set C (see Figures S10 (a) and (b) and S11 in the Supporting Information for the training process). However, most of this lost accuracy is regained by the ensemble model which efficiently reduces the increased model variance (see Table 5 and Figures S12 (a) and (b) in the Supporting Information for the prediction error distribution). Hence, the lMLP concept is able to extend the represented conformation space, while it retains the accuracy.
The higher RMSE values obtained in the training of reference data set C compared to B result from the even larger energy range and broader atomic force component distribution (see Table 2 and Table S4 in the Supporting Information), while the model architecture and training hyperparameters remained unchanged. Still, especially the ensemble atomic force component RMSE is similar to other state-of-the-art HDNNPs trained for less elements. 19 To investigate the efficiency for learning additional chemical systems, the system-sorted reference data set B is split into two blocks, whereby the fraction of the second block is p late . The conformations are alphabetically sorted in the order central carbon atom, leaving group, and nucleophile. Learning additional S N 2 reactions for central tert-butyl carbon atoms, i.e., adding nucleophiles and leaving groups at a late epoch which are only known for central methyl carbon atoms, yields a similar accuracy as learning on all data from the beginning (p late < 0.5 in Figures 4 (a) and (b)). For p late ⩾ 0.5 only reactions with central methyl carbon atoms are contained in the initial training data leading to an increase of the final test RMSE values. We emphasize that for p late ⩾ 0.8 also some elements are missing in the initial training data. Still, the accuracy is better than that obtained using RPROP and for the energies it is similar to the Adam results (Table 4).
Similar to the aforementioned case, ensembling can efficiently reduce the model variance introduced by incremental learning and hence is an important tool for lifelong machine learning. The ensemble accuracy is RMSE(E) = 3.1 meV atom −1 and RMSE(F α,n ) = 78 meV Å −1 for p late ∈ [0.5,0.8] and therefore about 21% larger than for p late = 0. However, in these cases of learning additional chemical systems the lifelong learning strategies require further development to be on par with the accuracy of training on all data.
As this work is a proof of concept for lMLPs, more and different chemical systems need to be explored in future work to fine-tune and improve the lifelong learning strategies. Still, we showed that lifelong learning can reach the same accuracy  as training on all data, while the flexibility of the training process was significantly increased.

Ensemble Prediction and Uncertainty Quantification.
To validate the ensemble prediction and examine the uncertainty quantification, the lMLP trained by the CoRe optimizer on reference data set B was applied on conformations obtained from constrained DFT optimizations on a dense grid of r Cl−C and r Br−C distances for the S N 2 reaction Br − + CH 3 Cl ⇌ BrCH 3 + Cl − . We emphasize that reference data set B uses a sparser and irregular sampling of this reaction and the validation conformations are unlikely to be in the reference data set B. Hence, the smoothness of the lMLP potential energy surface can be validated. The validation set contained 1062 conformations with maximal atomic force components of 16 eV Å −1 . Figure 5 reveals that most sections of the represented potential energy surface are predicted within chemical accuracy, i.e., 1 kcal mol −1 = 4.184 kJ mol −1 , with respect to the DFT reference energies. For this six atom system an error of less than 7.2 meV atom −1 is therefore required. Larger errors are only observed for high energy conformations, which we expected because the maximal atomic force component can be 1 eV Å −1 higher than that of the training data. The small errors with smooth distributions in the trained conformation space prove the smoothness of the lMLP.
The reliability of the uncertainty quantification is addressed in Figure 6 (a) and (b). The uncertainty quantification is equal to or larger than the absolute error with respect to the DFT reference energies for 98.6% of the validation data with uncertainties ΔE ⩽ 10 meV atom −1 . For the atomic force components the fraction is 99.7% for ΔF α,n ⩽ 250 meV Å −1 . Hence, for most conformations the magnitude of the error is reliably predicted. For large errors the uncertainty quantification is expected to underestimate the errors (see Section 2.6) so that the above-mentioned fractions decrease to 66% for ΔE > 10 meV atom −1 and 93% for ΔF α,n > 250 meV Å −1 . Still, a reliable identification of large errors is provided.

CONCLUSIONS
This work introduces the concept of lMLPs that can fine-tune and extend their representation in a rolling fashion. Hence, the lMLP concept unites MLP model efficiency and accuracy with flexibility. For an lMLP, a universal and computationally efficient atomic structure representation, an MLP model, uncertainty quantification, and lifelong learning strategies need to be combined. Therefore, we introduced eeACSF vectors for the structural representation, which are size-independent with respect to the number of chemical elements in contrast to many other common MLP descriptors. Their representation performance of an S N 2 reference data set is similar to that of ACSF vectors at the break-even point of computational cost, which is at about four different elements. For several more elements eeACSF vectors become the only computationally reasonable option due to the combinatorial growth of ACSF vectors. An ensemble HDNNP model using eeACSFs can predict an S N 2 reference data set with ten different elements with a state-ofthe-art accuracy of previous HDNNPs trained on less elements. Further, an ensemble of HDNNPs is a reliable way to quantify the uncertainty due to model variance and to identify conformations with high uncertainty in predictions. Additionally, ensembling increases the accuracy yielding potential energy surfaces with chemical accuracy for the S N 2 reactions.  As a basis of our lifelong learning strategies, we introduced the CoRe optimizer, which can combine and improve the fast convergence of RPROP and the high final accuracy of the Adam optimizer. In the training of this work, the CoRe optimizer significantly improves the HDNNP ensemble accuracy by about 33% for energies and forces compared to the Adam optimizer. Applying lifelong adaptive data selection further improves the accuracy and enables to narrow the training data set and exclude doubtful data during the training process. The CoRe optimizer and lifelong adaptive data selection can also improve training of machine learning models beyond lMLPs.
Finally, an lMLP can adapt to additional data that can be continuously added at any point in the training process. In this way, improvements of lMLPs are possible without learning again on all previous data and still a reliable method is obtained due to uncertainty quantification. In learning cases that are obtained during active learning or in the extension of the conformation space for the same reaction systems, the training accuracy is similar to that of learning on a stationary batch of all data. Even adding new reaction systems can be performed by the presented algorithms with only moderate accuracy loss. The benefit of lifelong learning is the enhanced flexibility of the training process enabling rolling explorations of chemical reactivity and training continuation of previous lMLPs. Moreover, adaptability of lMLPs is especially advantageous for large reference data sets where training on all data at once is computationally very demanding. We emphasize that the lMLP concept can also be applied for other MLP approaches beyond HDNNPs in future work. ■ ASSOCIATED CONTENT
Activation function, weight initialization, DFT free atom energies, ACSF parameters, performance evaluation of activation function and weight initialization, energy distribution of reference data, prediction error distributions, training process comparison of ACSFs and eeACSFs, performance evaluation of random data selection, and lMLP training processes (PDF)